Unsupervised Learning

“Contrary to supervised learning, if you give only the set of input data which is neither labeled nor classified, and has no corresponding outputs to the model, it shall find interesting patterns and relationships between the inputs which is called unsupervised learning.Based on the patterns, similarities or even differences, the sorting happens. Since the model is given unlabeled data, there is no error or supervisory signal to evaluate the potential solution.” link

Clustering

Clustering is a common unspervised learning tool. Coming from GIS clustering should be intuitive but instead of just nearness in physical distance, we can use other types of measures to distinguish similar objets.

Clustering is useful in exploratory data analysis if you don’t know too much about your data.

Types

  • Hard Clustering: is when each data point can only belong to one cluster.
  • Soft Clustering: is when each data point can belong to more tahn one cluster, usually based on some probability.

Methods

Algorithm Selection

“Clustering is in the eye of the beholder!”

Clustering is an subjective task and there can be more than one correct clustering algorithm. Every algorithm follows a different set of rules for defining the ‘similarity’ among data points. The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one clustering algorithm over another. An algorithm might work well on a particular dataset but fail for a different kind of dataset. DataCamp

K-means clustering

Let’s load some packages that we will be using

library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(GGally)

## here we are going to find the root_file 
root <- rprojroot::find_rstudio_root_file()
## we are going to set the data file path here
dataDir <- file.path(root, 'Data')

we will load our data that we need which comes from the ACS. We have taken four demographic data points.

# Load the .csv files
apr14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-apr14.csv")
may14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-may14.csv")
jun14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jun14.csv")
jul14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jul14.csv")
aug14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-aug14.csv")
sep14 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-sep14.csv")

uber14 <- bind_rows(apr14,may14,jun14,jul14,aug14,sep14)
glimpse(uber14)
## Observations: 4,534,327
## Variables: 4
## $ `Date/Time` <chr> "4/1/2014 0:11:00", "4/1/2014 0:17:00", "4/1/2014 0:…
## $ Lat         <dbl> 40.7690, 40.7267, 40.7316, 40.7588, 40.7594, 40.7383…
## $ Lon         <dbl> -73.9549, -74.0345, -73.9873, -73.9776, -73.9722, -7…
## $ Base        <chr> "B02512", "B02512", "B02512", "B02512", "B02512", "B…
uber14 <- uber14 %>% mutate(`Date/Time` = mdy_hms(`Date/Time`)) %>% 
    rename(DT = `Date/Time`) %>% 
    mutate(year = year(DT),
           month = month(DT),
           day = day(DT),
           weekday = wday(DT),
           hour = hour(DT),
           minute = minute(DT),
           second = second(DT))

The goal of k-means clutstering is to minimizing the total intra-cluster variation also known as the within-cluster variation. Basically we want to sum the squared distances in this case the Euclidean distnaces between items and the corresponding centroids being created.

\[W(C_{k}) = ∑_{x_{i}∈C_{k}}(x_{i}-μ_{k})^2\] \(x_i\) is a data point \(i\) belonging to cluster \(C_k\). \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\).

k means is subject to random starting conditions so starting with 25 random centroids and depending on where it gets placed you may get different results.

samp14 <- uber14 %>%
    sample_frac(size = 0.01)

set.seed(20)
clusters <- kmeans(samp14 %>% select(Lat, Lon), centers = 5, nstart = 20)

samp14 <- samp14 %>% mutate(cluster = clusters$cluster)
samp14 %>% 
ggplot(aes(Lat, Lon, color = as.factor(cluster))) + 
    geom_point(alpha = 0.3, shape = 1, size = .5) +
    theme_minimal()

samp14 %>% 
ggplot(aes(Lat, Lon, color = as.factor(cluster))) + 
    geom_point(alpha = 0.3, shape = 1, size = .5) + 
    facet_wrap(~cluster) + theme_minimal()

library(tmap)
library(sf)
nybb <- sf::read_sf(file.path(dataDir, "gis/nybb"))
uberPoint <- sf::st_as_sf(samp14, coords = c("Lon", "Lat"))
tmap_mode("plot")
tm_shape(shp = nybb) + tm_polygons(col = "white") + 
    tm_shape(uberPoint) + tm_symbols(col = "cluster", scale = 0.5, size = 0.5) + 
    tm_facets(by = "cluster",free.coords = FALSE) + tm_legend(show = FALSE)

library(leaflet)
pal <- colorFactor(c("navy", "red", "green", "orange", "purple","yellow"), domain = 1:5)
m <- samp14 %>% 
leaflet() %>% setView(lng = -73.9976343, lat = 40.7411095, zoom = 11)
m %>% addProviderTiles(providers$CartoDB.Positron) %>% 
    addCircleMarkers(lng = ~Lon, lat = ~Lat, color = ~pal(cluster), 
                     radius = 2, stroke = FALSE, fillOpacity = 0.2)

Here is a calculation to create a chart of the total within-groups sums of squares against the number of clusters in a K-means solution. The output will start to look like a curve, and where the bend happens should be where we choose the number of K.

kMax <- 13 # Maximal number of clusters
out <- purrr::map_dbl(1:kMax, 
               function(k){kmeans(samp14 %>% select(Lat, Lon), 
                                  centers = k)$tot.withinss})

data.frame(out, clusters = 1:kMax) %>% 
    ggplot(aes(x = clusters, y = out)) + geom_point() +
    geom_line() + xlab("Number of clusters K") +
       ylab("Total within-clusters sum of squares") + theme_minimal()

Or using the useful package

uberBest <- useful::FitKMeans(samp14 %>% select(Lat, Lon), max.clusters = 20, nstart = 20,
                seed = 1234)
useful::PlotHartigan(uberBest)

running kmeans with 3 clusters

kmeans(samp14 %>% select(Lat, Lon), centers = 6) -> k6

checking the size and centers of the clusters

k6$size
## [1]  5643 16710  2182  1380   443 18985
k6$centers
##        Lat       Lon
## 1 40.68755 -73.96399
## 2 40.76621 -73.97223
## 3 40.79587 -73.87511
## 4 40.66846 -73.76063
## 5 40.70085 -74.19881
## 6 40.73131 -73.99814

adding back labels to our dataset

samp14 <- samp14 %>% 
    mutate(cluster = as.factor(k6$cluster))

Let’s plot our results

ggally_mysmooth <- function(data, mapping, ...){
    ggplot(data = data, mapping=mapping) +
        geom_density(fill=NA)
}

ggpairs(samp14 %>% select(cluster, Lat, Lon), mapping = ggplot2::aes(color = cluster),
        lower = list(continuous = wrap("points", shape = 1, alpha = .25)),
        diag = list(continuous = ggally_mysmooth))

samp14 %>% ungroup() %>%
    group_by(cluster, month) %>%
    count() %>%
    ggplot(aes(x = month, y = n, color = factor(cluster))) +
    geom_line() +
    theme_minimal()

K Nearest Neighbor

Let’s first read in the data

 MN <- readr::read_csv(file.path(dataDir,"MN_Pluto.csv"))

Let’s take a few columns and Multi-Family Elevator buildings and Industrial/Manufacturing Lots

MNSelect <- MN %>% select(numfloors, lotarea, assesstot, landuse, xcoord, ycoord) %>% 
    filter(landuse %in% c("05", "09") & complete.cases(.)) %>% 
    mutate(landuse = if_else(landuse == "05", "Comm", "Park"))

Now let’s scale our variables

MNScaled <- MNSelect %>% mutate(numfloors = scale(numfloors),
              lotarea = scale(lotarea),
              assesstot = scale(assesstot),
              landuse = as.character(landuse))
table(MNScaled$landuse)
## 
## Comm Park 
## 4420  460

There is still 5,643 rows which might be too large for our analysis. For ease let’s tak a sample fraction

MNSamp <- MNScaled %>% group_by(landuse) %>% sample_frac(size = 0.5) %>% ungroup() 

Let’s visualize what this looks like so far first with as AssessTot and NumFloors

ggplot(MNSamp, aes(x = assesstot, y = numfloors, color = landuse)) + 
    geom_point() + theme_minimal()

Let’s now take a look at plotting and XCoord and YCoord

ggplot(MNSamp, aes(x = xcoord, y = ycoord, color = landuse)) + 
    geom_point() + theme_minimal()

Now we are going to make the train test split of 70% train, 30% test.

smp <- sample(2, nrow(MNSamp), replace=TRUE, prob=c(0.7, 0.3))

We will now take this sample vector and filter through our data

trainDF <- MNSamp %>% filter(smp == 1) %>% 
    select(assesstot, numfloors)
testDF <-  MNSamp %>% filter(smp == 2) %>% 
    select(assesstot, numfloors)

trainLabel <- MNSamp %>% filter(smp == 1) %>% 
    select(landuse)
testLabel <- MNSamp %>% filter(smp == 2) %>% 
    select(landuse)
library(class)
dfPred <- knn(trainDF, testDF, trainLabel$landuse, k = 9)
library(gmodels)
CrossTable(testLabel$landuse,dfPred,prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  726 
## 
##  
##                   | dfPred 
## testLabel$landuse |      Comm |      Park | Row Total | 
## ------------------|-----------|-----------|-----------|
##              Comm |       641 |        13 |       654 | 
##                   |     0.980 |     0.020 |     0.901 | 
##                   |     0.979 |     0.183 |           | 
##                   |     0.883 |     0.018 |           | 
## ------------------|-----------|-----------|-----------|
##              Park |        14 |        58 |        72 | 
##                   |     0.194 |     0.806 |     0.099 | 
##                   |     0.021 |     0.817 |           | 
##                   |     0.019 |     0.080 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |       655 |        71 |       726 | 
##                   |     0.902 |     0.098 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 
testDF %>% bind_cols(testLabel) %>% 
    mutate(pred = dfPred) %>% 
    mutate(test = if_else(pred == landuse, "Correct", "Wrong")) -> output
output %>% ggplot(aes(x = assesstot, y = numfloors, color = test)) + geom_point() +
    facet_wrap( ~ landuse) + theme_minimal()

Let’s try with x and y coordinates

trainCoordDF <- MNSamp %>% filter(smp == 1) %>% 
    select(xcoord, ycoord)
testCoordDF <-  MNSamp %>% filter(smp == 2) %>% 
    select(xcoord, ycoord)

trainLabel <- MNSamp %>% filter(smp == 1) %>% 
    select(landuse)
testLabel <- MNSamp %>% filter(smp == 2) %>% 
    select(landuse)
library(class)
dfCoordPred <- knn(trainCoordDF, testCoordDF, trainLabel$landuse, k = 9)
library(gmodels)
CrossTable(testLabel$landuse,dfCoordPred,prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  726 
## 
##  
##                   | dfCoordPred 
## testLabel$landuse |      Comm |      Park | Row Total | 
## ------------------|-----------|-----------|-----------|
##              Comm |       644 |        10 |       654 | 
##                   |     0.985 |     0.015 |     0.901 | 
##                   |     0.920 |     0.385 |           | 
##                   |     0.887 |     0.014 |           | 
## ------------------|-----------|-----------|-----------|
##              Park |        56 |        16 |        72 | 
##                   |     0.778 |     0.222 |     0.099 | 
##                   |     0.080 |     0.615 |           | 
##                   |     0.077 |     0.022 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |       700 |        26 |       726 | 
##                   |     0.964 |     0.036 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 
testCoordDF %>% bind_cols(testLabel) %>% 
    mutate(pred = dfCoordPred) %>% 
    mutate(test = if_else(pred == landuse, "Correct", "Wrong")) -> output
output %>% ggplot(aes(x = xcoord, y = ycoord, color = test)) + geom_point() +
    facet_wrap( ~ landuse) + theme_minimal()

Hierarchical Clustering

MNcouncilScaled <- MN %>% 
    filter(complete.cases(council,lotarea,comarea,resarea,garagearea,builtfar,facilfar,yearbuilt)) %>%   
    group_by(council) %>% 
    summarise(comarea = sum(comarea),
              resarea = sum(resarea),
              garagearea = sum(garagearea),
              builtfar = sum(builtfar),
              facilfar = sum(facilfar),
              yearbuilt = median(yearbuilt)) %>% 
    ungroup() %>% 
    mutate(council = as.character(council)) %>% 
    mutate_if(is.numeric, scale)
dist_mat <- dist(MNcouncilScaled %>% select(-council), method = 'euclidean')
hclust_avg <- hclust(dist_mat, method = 'average')
plot(hclust_avg)

cut_avg <- cutree(hclust_avg, k = 5)
plot(hclust_avg)
rect.hclust(hclust_avg , k = 5, border = 1:5)

library(dendextend)
avg_dend_obj <- as.dendrogram(hclust_avg)
avg_col_dend <- color_branches(avg_dend_obj,k = 5)
plot(avg_col_dend)

MNcouncilScaled <- MNcouncilScaled %>% mutate(cluster = cut_avg)
count(MNcouncilScaled, cluster)
## # A tibble: 5 x 2
##   cluster     n
##     <int> <int>
## 1       1     1
## 2       2     6
## 3       3     1
## 4       4     1
## 5       5     1
MNcouncilScaled %>% 
    ggplot(aes(x=comarea, y = resarea, color = factor(cluster))) + 
    geom_point() + 
    theme_minimal()

MNcouncilScaled %>% tidyr::gather("varY", "val", -c(council,resarea, cluster)) %>% 
    ggplot(aes(x=val, y = resarea, color = factor(cluster))) + 
    geom_point() + 
    geom_text(aes(label= council),  check_overlap = TRUE, nudge_y = .1, nudge_x = .1) + 
    facet_wrap(~varY) +
    theme_minimal()